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Abstract 


1 Introduction 


As automated image analysis progresses, there 
is increasing interest in richer linguistic an¬ 
notation of pictures, with attributes of ob¬ 
jects (e.g., furry, brown ...) attracting most 
attention. By building on the recent “zero- 
shot learning” approach, and paying atten¬ 
tion to the linguistic nature of attributes as 
noun modifiers, and specifically adjectives, 
we show that it is possible to tag images 
with attribute-denoting adjectives even when 
no training data containing the relevant an¬ 
notation are available. Our approach relies 
on two key observations. First, objects can 
be seen as bundles of attributes, typically ex¬ 
pressed as adjectival modifiers (a dog is some¬ 
thing furry, brown, etc.), and thus a function 
trained to map visual representations of ob¬ 
jects to nominal labels can implicitly learn 
to map attributes to adjectives. Second, ob¬ 
jects and attributes come together in pictures 
(the same thing is a dog and it is brown). 
We can thus achieve better attribute (and ob¬ 
ject) label retrieval by treating images as “vi¬ 
sual phrases”, and decomposing their linguis¬ 
tic representation into an attribute-denoting 
adjective and an object-denoting noun. Our 
approach performs comparably to a method 
exploiting manual attribute annotation, it out¬ 
performs various competitive alternatives in 
both attribute and object annotation, and it au¬ 
tomatically constructs attribute-centric repre¬ 
sentations that significantly improve perfor¬ 
mance in supervised object recognition. 


* Current affiliation: Thomas J. Watson Research Center, 
IBM, gdinu@us.ibm.com 


As the quality of image analysis algorithms im¬ 
proves, there is increasing interest in annotating im¬ 
ages with linguistic descriptions ranging from sin¬ 
gle words describing the depicted objects and their 
properties ( [Farhadi et ah, 200^ |Lampert et ah, 


20091 to richer expressions such as full-fledged im¬ 


age captions ([Kulkami et ah, 201 1[ [Mitchell et ah, 


20121. This trend has generated wide interest in lin¬ 


guistic annotations beyond concrete nouns, with the 
role of adjectives in image descriptions receiving, in 
particular, much attention. 

Adjectives are of special interest because of their 
central role in so-called attribute-centric image rep¬ 
resentations. This framework views objects as bun¬ 
dles of properties, or attributes, commonly ex¬ 
pressed by adjectives {e.g., furry, brown), and uses 
the latter as features to learn higher-level, seman¬ 


tically richer representations of objects (Farhadi et 
ah, 2009| n Attribute-based methods achieve better 


generalization of object classifiers with less train¬ 
ing data (Lampert et ah, 20091, while at the same 
time producing semantic representations of visual 
concepts that more accurately model human se- 


*In this paper, we assume that, just like nouns are the lin¬ 
guistic counterpart of visual objects, visual attributes are ex¬ 
pressed by adjectives. An informal survey of the relevant litera¬ 
ture suggests that, when attributes have linguistic labels, they 
are indeed mostly expressed by adjectives. There are some 
attributes, such as parts, that are more naturally expressed by 
prepositional phrases (PPs: with a tail). Interestingly, |Dinu and| 
IBaroni (2014[ l showed that the decomposition function we will 
adopt here can derive both adjective-noun and noun-PP phrases, 
suggesting that our approach could be seamlessly extended to 
visual attributes expressed by noun-modifying PPs. 























mantic intuition ( [Silberer et al., 201^ . Moreover, 
automated attribute annotation can facilitate finer- 
grained image retrieval (e.g., searching for a rocky 
beach rather than a sandy beach) and provide the 
basis for more accurate image search (for example 


in cases of visual sense disambiguation (Divvala et 


al., 20141, where a user disambiguates their query by 


searching for images of wooden cabinet as furniture 
and not just cabinet, which can also mean council). 

Classic attribute-centric image analysis requires, 
however, extensive manual and often domain- 


specific annotation of attributes (Vedaldi et al., 


2014 1 , or, at best, complex unsupervised image- 


and-text-mining procedures to learn them (Berg et 


al., 20101. At the same time, resources with high- 


quality per-image attribute annotations are limited; 
to the best of our knowledge, coverage of all pub¬ 
licly available datasets containing non-class specific 
attributes does not exceed 100 attributes]^ orders 
of magnitude smaller than the equivalent object- 
annotated datasets (Deng et al., 20091. Moreover, 
many visual attributes currently available (e.g., ID- 
boxy, furniture leg), albeit visually meaningful, do 
not have straightforward linguistic equivalents, ren¬ 
dering them inappropriate for applications requir¬ 
ing natural linguistic expressions, such as the search 
scenarios considered above. 

A promising way to limit manual attribute anno¬ 
tation effort is to extend recently proposed zero-shot 
learning methods, until now applied to object recog¬ 
nition, to the task of labeling images with attribute¬ 
denoting adjectives. The zero-shot approach relies 
on the possibility to extract, through distributional 
methods, semantically effective vector-based word 
representations from text corpora, on a large scale 
and without supervision (Turney and Pantel, 20101. 
In zero-shot learning, training images labeled with 
object names are also represented as vectors (of fea¬ 
tures extracted with standard image-analysis tech¬ 
niques), which are paired with the vectors repre¬ 
senting the corresponding object names in language- 
based distributional semantic space. Given such 


^The attribute datasets we are aware of are the ones of 
Farhadi et al. (20 10^ , [Ferrari and Zisserman (2007 f and [Rus-j 
sakovsky and Fei-Fei (2010j , containing annotations for 64, 7 
and 25 attributes, respectively. (This count excludes the SUN 
Attributes Database jPatterson et al., 20141 , whose attributes 
characterize scenes rather than concrete objects.) 
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Figure 1; t-SNE fVan der Maaten and Hinton, 2008] l visu¬ 
alization of 3 objects together with the 2 nearest attributes 
in our visual space (left), and of the corresponding nouns 
and adjectives in linguistic space (right). 
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paired training data, various algorithms (jSocher cT 


jal, 2013||Frome et al., 2013j|Lazaridou et al., 2014| ) 

can be used to induce a cross-modal projection of 
images onto linguistic space. This projection is then 
applied to map previously unseen objects to the cor¬ 
responding linguistic labels. The method takes ad¬ 
vantage of the similarities in the vector space topolo¬ 
gies of the two modalities, allowing information 
propagation from the limited number of objects seen 
in training to virtually any object with a vector-based 
linguistic representation. 

To adapt zero-shot learning to attributes, we rely 
on their nature as (salient) properties of objects, and 
on how this is reflected linguistically in modifier re¬ 
lations between adjectives and nouns. We build on 
the observation that visual and linguistic attribute- 
adjective vector spaces exhibit similar structures: 
The correlation p between the pairwise similari¬ 
ties in visual and linguistic space of all attributes- 
adjectives from our experiments is 0.14 (significant 
at p < 0.05)0 While the correlation is smaller 
than for object-noun data (0.23), we conjecture it 
is sufficient for zero-shot learning of attributes. We 
will confirm this by testing a cross-modal projection 
function from attributes, such as colors and shapes, 
onto adjectives in linguistic semantic space, trained 
on pre-existing annotated datasets covering less than 
100 attributes (Experiment 1). 

We proceed to develop an approach achieving 
equally good attribute-labeling performance without 
manual attribute annotation. Inspired by linguistic 
and cognitive theories that characterize objects as at¬ 


tribute bundles (Murphy, 2002 1 , we hypothesize that 


when we learn to project images of objects to the 
corresponding noun labels, we implicitly learn to 


^In this paper, we report significance at a = 0.05 threshold. 
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Figure 2: Images tagged with orange and liqueur are 
mapped in linguistic space closer to the vector of the 
phrase orange liqueur than to the orange or liqueur vec¬ 
tors (t-SNE visualization) (the hgure also shows the near¬ 
est neighbours of phrase, adjective and noun in linguis¬ 
tic space). The mapping is trained using solely noun- 
annotated images. 


associate the visual properties/attributes of the ob¬ 
jects to the corresponding adjectives. As an exam¬ 
ple, Figure [T] {left) displays the nearest attributes of 
car, bird and puppy in the visual space and, inter¬ 
estingly, the relative distance between the noun de¬ 
noting objects and the adjective denoting attributes 
is also preserved in the linguistic space {right). 


We further observe that, as also highlighted by 
recent work in object recognition, any object in an 
image is, in a sense, a visual phrase (Sadeghi and 


Farhadi, 2011 Divvala et ah, 2014] |, i.e., the object 


and its attributes are mutually dependent. For exam¬ 
ple, we cannot visually isolate the object drum from 
attributes such as wooden and round. Indeed, within 
our data, in 80% of the cases the projected image 
of an object is closer to the semantic representation 
of a phrase describing it than to either the object or 
attribute labels. See Figurej^for an example. 

Motivated by this observation, we turn to recent 
work in distributional semantics defining a vector 
decomposition framework (Dinu and Baroni, 20141 
which, given a vector encoding the meaning of a 
phrase, aims at decoupling its constituents, produc¬ 
ing vectors that can then be matched to a sequence 
of words best capturing the semantics of the phrase. 
We adopt this framework to decompose image rep¬ 
resentations projected onto linguistic space into an 


adjective-noun phrase. We show that the method 
yields results comparable to those obtained when us¬ 
ing attribute-labeled training data, while only requir¬ 
ing object-annotated data. Interestingly, this decom- 
positional approach also doubles the performance 
of object/noun annotation over the standard zero- 
shot approach (Experiment 2). Given the positive 
results of our proposed method, we conclude with 
an extrinsic evaluation (Experiment 3); we show 
that attribute-centric representations of images cre¬ 
ated with the decompositional approach boost per¬ 
formance in an object classification task, supporting 
claims about its practical utility. 

In addition to contributions to image annotation, 
our work suggests new test beds for distributional 
semantic representations of nouns and associated 
adjectives, and provides more in-depth evidence of 
the potential of the decompositional approach. 

2 General experimental setup 
2.1 Cross-Modal Mapping 

Our approach relies on cross-modal mapping 
from a visual semantic space V, populated with 
vector-based representations of images, onto a 
linguistic (distributional semantic) space W of word 
vectors. The mapping is performed by first inducing 
a function fproj ■ from data points 

{vi,Wi), where Vi G is a vector representation 
of an image tagged with an object or an attribute 
(such as dog or metallic), and Wi G is the 
linguistic vector representation of the corresponding 
word. The mapping function can subsequently be 
applied to any given image G E to obtain its 
projection w'^ £ W onto linguistic space: 

~ fproj {Vi) 

Specifically, we consider two mapping methods. In 
the Ridge regression approach, we learn a linear 
function Fproj E M'^zxdi solving the Tikhonov- 
Phillips regularization problem, which minimizes 
the following objective: 

\\W^" - FprojV^l\l-\\XFproj\\l 

where and are obtained by stacking the 
word vectors Wi and corresponding image vectors 
Vi, from the training setj^ 

''The parameter A is determined through cross-validation on 
the training data. 
















Second, motivated by the success of Canonical 
Correlations Analysis (CCA) (Hotelling, 19361 in 
several vision-and-language tasks, such as image 
and caption retrieval (Gong et al., 2014[ Hardoon et 


al., 2004 Hodosh et al., 20131, we adapt normalized 


Canonical Correlations Analysis (nCCA) to our 
setup. Given two paired observation matrices X 
and Y, in our case W'^'^ and CCA seeks two 
projection matrices A and B that maximize the 
correlation between A'^X and B^Y. This can be 
solved efficiently by applying SVD to 


Al/2 ^ 


XY 


cl/y = UT^V^ 


where C stands for the covariance matrix. Finally, 

■' 1/2 

the projection matrices are defined as A = 
and B = cl/yV. 


Gong ef al. (20141 propose a nor¬ 


malized variant of CCA, in which the projection ma¬ 
trices are further scaled by some power A of the sin¬ 
gular values S returned by the SVD solution. In our 
experiments, we tune the choice of A on the training 
data. Trivially, if A = 0, nCCA reduces to CCA. 

Note that other mapping functions could also be 
used. We leave a more extensive exploration of pos¬ 
sible alternatives to further research, since the details 
of how the vision-to-text conversion is conducted are 
not crucial for the current study. As increasingly 
more effective mapping methods are developed, we 
can easily plug them into our architecture. 

Through the selected cross-modal mapping func¬ 
tion, any image can be projected onto linguistic 
space, where the word (possibly of the appropriate 
part of speech) corresponding to the nearest vector 
is returned as a candidate label for the image (fol¬ 
lowing standard practice in distributional semantics, 
we measure proximity by the cosine measure). 


2.2 Decomposition 

Dinu and Baroni (2014| ) have recently proposed a 
general decomposition framework that, given a dis¬ 
tributional vector encoding a phrase meaning and 
the syntactic structure of that phrase, decomposes 
it into a set of vectors expected to express the se¬ 
mantics of the words that composed the phrase. In 
our setup, we are interested in a decomposition func¬ 
tion fuec ■ —>• which, given a visual vec¬ 

tor projected onto the linguistic space, assumes it 
represents the meaning of an adjective-noun phrase, 
and decomposes it into two vectors corresponding to 


the adjective and noun constituents [wadj] Wnoun] = 
foeciwAN)- We take /oec to be a linear function 
and, following Dinu and Baroni (2014| |, we use as 
training data vectors of adjective-noun bigrams di¬ 
rectly extracted from the corpus together with the 
concatenation of the corresponding adjective and 
noun word vectors. We estimate foec by solving a 
ridge regression problem minimizing the following 
objective: 


- FdecW'Xj,\\i - IIAFd, 


rTr 


Tr ||2 


|2 

dec 112 


Where 

by stacking the training data vectors. The A param¬ 
eter is tuned through generalized cross-validation 


' adj’ are the matrices obtained 


(Hastie et al., 20091. 


2.3 Representational Spaces 

Linguistic Space We construct distributional vec¬ 
tors from text through the method recently proposed 


by Mikolov et al. (20131, to which we feed a cor¬ 
pus of 2.8 billion words obtained by concatenating 
English Wikipedia, ukWaC and BNC|^ Specifically, 
we used the CBOW algorithm, which induces vec¬ 
tors by predicting a target word given the words sur¬ 
rounding it. We construct vectors of 300 dimensions 
considering a context window of 5 words to either 
side of the target, setting the sub-sampling option to 
le-05 and the negative sampling parameter to 5|^ 

Visual Spaces Following standard practice, im¬ 
ages are represented as bags of visual words 
(BoVW) ( Sivic and Zisserman, 2003| )n Local low- 
level image features are clustered into a set of visual 
words that act as higher-level descriptors. In our 
case, we use PHOW-color image features, a vari¬ 
ant of dense SIFT ( |Bosch et al., 2007] ), and a vi¬ 
sual vocabulary of 600 words. Spatial information 
is preserved with a two-level spatial pyramid rep¬ 


resentation (Lazebnik et al., 20061, achieving a h- 
nal dimensionality of 12,000. The entire pipeline 


is implemented using the VLFeat library (Vedaldi 


and Fulkerson, 20101, and its setup is identical to the 


^http://wacky.sslmit.unibo.it, http: 

//www.natcorp.ox.ac.uk 

"The parameters are tuned on the MEN word similarity 
dataset jBruni et at., 2014 1 . 

’in future research, we might obtain a performance boost 
simply by using the more advanced visual features recently in¬ 


troduced by Krizhevsky et al. (2012 1 . 
































Category 

Attributes 

Color 

black, blue, brown, gray, green, 
orange, pink, red, violet, white, yellow 

Pattern 

spotted, striped 

Shape 

long, round, rectangular, square 

Texture 

furry, smooth, rough, shiny, metallic, 
vegetation, wooden, wet 



Training 


Evaluation 



#im. #attr. 

#obj. 

#im. #attr. 

#obj. 

Exp. 1 

10,749 97 

- 

leave-one-attribute-out 

Exp. 2 

23,000 - 

750 

8,449 25 

203 


Table 3: Summary of training and evaluation sets. 


Table 1; List of attributes in the evaluation dataset. 



Attributes Object 

furry cat 

white 

smooth 

green cocktail 

shiny 


Table 2: Sample annotations from the evaluation dataset. 


toolkit’s basic recognition sample application!^ We 
apply Positive Pointwise Mutual Information ([Evert, 


20051 to the BoVW counts, and reduce the resulting 


vectors to 300 dimensions using SVD. 


2.4 Evaluation Dataset 


For evaluation purposes, we use the dataset consist¬ 
ing of images annotated with adjective-noun phrases 
introduced in Russakovsky and Fei-Fei (20101, 
which pertains to 384 WordNet/ImageNet synsets 
with 25 images per synset. The images were manu¬ 
ally annotated with 25 attribute-denoting adjectives 
related to texture, color, pattern and shape, respect¬ 
ing the constraints that a color must cover a signifi¬ 
cant part of the target object, and all other attributes 
must pertain to the object as a whole (as opposed 
to parts). Table [T] lists the 25 attributes and Table 
illustrates sample annotations]^ 

In order to increase annotation quality, we only 
consider attributes with full annotator consensus, for 
a total of 8,449 annotated images, with 2.7 attributes 
per-image on average. Furthermore, to make the lin¬ 
guistic annotation more natural and avoid sparsity 
problems, we renamed excessively specific objecfs 
wifh a noun denofing a more general category, fol¬ 


lowing recenf work on entry-level categories (Or- 


I ^ http://WWW.vlfeat.org/applications/ 

apps■html _ 

’Although vegetation is a noun, we have kept it in the eval¬ 
uation set, treating it as an adjective. 


donez ef ah, 201 3} ; e.g., colobus guereza was re¬ 
labeled as monkey. The final evaluafion dafasef con- 
fains 203 disfincf objecfs. 


3 Experiment 1: Zero-shot attribute 
learning 

In Section [T] we showed fhaf fhere is a signifi- 
canf correlafion between pairwise similarities of ad¬ 
jectives in a language-based distributional seman¬ 
tic space and those of visual feature vectors ex¬ 
tracted from images labeled with the corresponding 
attributes. In the first experiment, we test whether 
this correspondence in attribute-adjective similar¬ 
ity structure across modalities suffices to success¬ 
fully apply zero-shot labeling. We learn a cross- 
modal function from an annotated dataset and use 
it to label images from an evaluation dataset with 
attributes outside the training set. We will refer to 
this approach as DlR^, for Direct Retrieval using 
Attribute annotation. Note that this is the first time 
that zero-shot techniques are used in the attribute 
domain. In the present evaluation, we distinguish 
DiR^-Ridge and DiR^-nCCA, according to the 
cross-modal function used to project from images to 
linguistic representations (see Section|24] above). 


3.1 Cross-modal training and evaluation 

To gather sufficient data to train a cross-modal 
mapping function for attributes/adjectives, we com¬ 


bine the publicly available datasets of Farhadi et al. 
(2009[ ) and [Ferrari and Zisserman (2007 1 with at¬ 
tributes and associated images extracted from MIR- 
FLICKR ( Huiskes and Lew, 2008| ) p^ The resulting 
dataset contains 72 distinct attributes and 2,300 im¬ 
ages. Each image-attribute pair represents a training 
data point (n, Wadj), where v is the vector represen¬ 
tation of the image, and Wadj is the linguistic vector 
of the attribute (corresponding to an adjective). No 
information about the depicted object is needed. 


'®We filtered out attributes not expressed by adjectives, such 
as wheel or leg. 




































Dir''-Ridge Dec 

Dir^-nCCA Russakovsky and Fei-Fei (2010) 



Figure 3: Performance of zero-shot attribute classification (as measured by AUC) compared to the supervised method 
of Russakovsky and Fei-Fei (20101, where available. The dark-red horizontal line marks chance performance. 


To further maximize the amount of training data 
points, we conduct a leave-one-attribute-out evalua¬ 
tion, in which the cross-modal mapping function is 
repeatedly learned on all 72 attributes from the train¬ 
ing set, as well as all but one attribute from the eval¬ 
uation set (Section [24| ), and the associated images. 
This results in 72 + (25 — 1) = 96 training attributes 
in total. On average, 45 images per attribute are 
used. The performance is measured for the single 
attribute that was excluded from training. A numeri¬ 
cal summary of the experiment setup is presented in 
the first row of Tabled 


3.2 Results and discussion 


Russakovsky and Fei-Fei (20101 trained separate 


SVM classifiers for each affribute in the evaluation 
dataset in a cross-validation setting. This fully su¬ 
pervised approach can be seen as an ambitious up¬ 
per bound for zero-shot learning, and we directly 
compare our performance to theirs using their figure 
of merit, namely area under the ROC curve (AUC), 
which is commonly used for binary classification 
problemsp] A perfect classifier achieves an AUC of 
1, whereas an AUC of 0.5 indicates random guess¬ 
ing. For purposes of AUC computation, DlR^ is 
considered to label test images with a given adjec¬ 
tive if the linguistic-space distance between their 
mapped representation and the adjective is below 
a certain threshold. AUC measures the aggregated 
performance over all thresholds. To get a sense of 


"Tablej^reports hit@fc results for DlR'^, which will be dis¬ 
cussed below in the context of Experiment 2. 


what AUC compares to in terms of precision and re¬ 
call, the AUC of DlR^ for furry is 0.74, while the 
precision is 71% and the corresponding recall 14%. 
For the more difficult blue case, AUC is at 0.5, pre¬ 
cision and recall are 2% and 55%, respectively. 

The AUC results are presented in Figure (ig¬ 
nore red bars for now). We observe first that, of the 
two mapping functions we considered, RiDGE (blue 
bars) clearly outperforms nCCA (yellow bars). Ac¬ 
cording to a series of paired permutation tests. 
Ridge has a significantly larger AUC in 13/25 
cases, nCCA in only 2. This is somewhat surpris¬ 
ing given the better performance of nCCA in the 
experiments of Gong et al. (20141. However, our 
setup is quite different from theirs: They perform 
all retrieval tasks by projecting the input visual and 
language data onto a common multimodal space dif¬ 
ferent from both input spaces. nCCA is a well- 
suited algorithm for this. We aim instead at produc¬ 
ing linguistic annotations of images, which is most 
straightforwardly accomplished by projecting visual 
representations onto linguistic space. Regression- 
based learning (in our case, via Ridge) is a more 
natural choice for this purpose. 

Coming now to a more general analysis of the re¬ 
sults, as expected, and analogously to the supervised 
setting, DiR^-R idge performance varies across at¬ 
tributes. Some achieve performance close to the 
supervised model (e.g., rectangular or wooden) 
and, for 18 out of 25, the performance is well 
above chance (bootstrap test). The exceptions are: 
blue, square, round, vegetation, smooth, spotted and 
striped. Interestingly, for the last 4 attributes in 

























this list, Russakovsky and Fei-Fei (20101 achieved 


their lowest performance, attributing it to the lower- 
quality of the corresponding image annotations. 


Furthermore, Russakovsky and Fei-Fei (20101 ex¬ 
cluded 5 attributes due to insufficient training data. 
Of these, our performance for blue, vegetation and 
square is not particularly encouraging, but for violet 
and pink we achieve more than 0.7 AUC, at the level 
of the supervised classifiers, suggesfing fhaf fhe pro¬ 
posed mefhod can complemenf fhe laffer when an- 
nofafed dafa are nof available. 

For a differenl perspective on fhe performance 
of DlR^, we look several objecfs and queried fhe 
model for fheir mosf common affribufe, based on fhe 
average affribufe rank across all images of fhe objecf 
in fhe dafasef. Reassuringly, we learn fhaf sunflow¬ 
ers are on average yellow (mean rank 23),flelds are 
green (4.4), cabinets are wooden (4) and vans metal¬ 
lic (6.6) {strawberries are, suspiciously, blue, 2.7). 

Overall, Ibis experimenf shows fhaf, jusf like ob- 
jecf classificafion, affribufe classifiers benefif from 
knowledge Iransfer befween fhe visual and linguis- 
fic modalifies, and zero-shof learning can achieve 
reasonable performance on affribufes and fhe corre¬ 
sponding adjectives. This conclusion is based on fhe 
assumpfion fhaf per-image annofafions of affribufes 
are available; in fhe following secfion, we show how 
equal and even heller performance can be attained 
using dafa sels annolaled wilh objecfs only, fhere- 
fore wilhoul any hand-coded affribufe informalion. 


Experiment 2: Learning attributes from 
objects and visual phrases 


lafions jointly. We compare if wilh sfandard zero- 
shof learning using direcl label relrieval as well as 
againsl a number of challenging allemafives fhaf ex- 
ploil gold-sfandard informalion aboul fhe depicled 
objecfs. The second row of Table gives a numeri¬ 
cal summary of fhe selup for Ibis experimenf. 


4.1 Cross-modal training 

We now assume objecf annofafions only, in fhe form 
of fraining dafa {v, Wnoun), where v is fhe vector 
represenlafion of an image tagged with an object and 
Wnoun is the linguistic vector of the corresponding 
noun. To ensure high imageability and diversity, we 
use as training object labels those appearing in the 
CIFAR-100 dataset (Krizhevsky, 20091, combined 
with those previously used in the work of Farhadi 
et al. (2009| l, as well as the most frequent nouns in 
our corpus that also exist in ImageNet, for a total 
of 750 objects-nouns. For each object label, we in¬ 
clude at most 50 images from the corresponding Im¬ 
ageNet synset, resulting in « 23,000 training data 
points. Images containing objects from the evalua¬ 
tion dataset are excluded, so that both adjective and 
noun retrieval adhere to the zero-shot paradigm. 


4.2 Object-agnostic models 

DiR® The Direct Retrieval using Object annota¬ 
tion approach projects an image onto the linguistic 
space and retrieves the nearest adjectives as candi¬ 
date attribute labels. The only difference with DlR^ 
(more precisely, DlR^-RlDGE), the zero-shot ap¬ 
proach we tested above, is that the mapping function 
has been trained on object-noun data only. 


Having shown that reasonably accurate annotations 
of unseen attributes can be obtained with zero-shot 
learning when a small amount of manual annota¬ 
tion is available, we now proceed to test the intu¬ 
ition, preliminarily supported by the data in Figure 
[T] that, since objects are bundles of attributes, at¬ 
tributes are implicitly learned together with objects. 
We thus try to induce attribute-denoting adjective la¬ 
bels by exploiting only widely-available object-noun 
data. At the same time, building on the observa¬ 
tion illustrated in Figure that pictures of objects 
are pictures of visual phrases, we experiment with 
a vector decomposition model which treats images 
as composite and derives adjective and noun anno- 


Dec The Decomposition method uses the fu^c 
function inspired by Dinu and Baroni (2014) (see 
Section [T2| ), to associate the image vector projected 
onto linguistic space to an adjective and a noun. We 
train //jec with about ss 50,000 training instances, 
selected based on corpus frequency. These data 
are further balanced by not allowing more than 100 
training samples for any adjective or noun in order 
to prevent very frequent words such as other or new 
from dominating the training data. No image data 
are used, and there is no need for manual annota¬ 
tion, as the adjective-noun tuples are automatically 
extracted from the corpus. 

At test time, given an image to be labeled. 










we project its visual representation onto the lin¬ 
guistic space and decompose the resulting vector 
w' into two candidate adjective and noun vectors: 
Wadj'^ w'noun] = fDec{w'). We then Search the lin¬ 
guistic space for adjectives and nouns whose vectors 
are nearest to and respectively. 


4.3 Object-informed models 


A cross-modal function trained exclusively on 
object-noun data might be able to capture only pro¬ 
totypical characteristics of an object, as induced 
from text, independently of whether they are de¬ 
picted in an image. Although the gold annotation 
of our dataset should already penalize this image- 


independent labeling strategy (see Section 2.4 1 , we 
control for this behaviour by comparing against 
three models that have access to the gold noun an¬ 
notations of the image and favor adjectives that are 
typical modifiers of the nouns. 


LM We build a bigram Language Model by using 


the Berkeley LM toolkit (Pauls and Klein, 2012 
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on the one-trillion-token Google Web IT corpu^' 
and smooth probabilities with the “Stupid” back¬ 
off technique ( Brants et ah, 2007| ). Given an 
image with object-noun annotation, we score all 
attributes-adjectives based on the language-model- 
derived conditional probability p{adjective\noun). 
All images of the same object produce identical 
rankings. As an example, among the top attributes 
of cocktail we find heady, creamy wA fruity. 


VLM LM does nof exploif visual information 
abouf fhe image fo be annofafed. A nafural way fo 
enhance if is fo combine if wifh DlR°, our cross- 
modal mapping adjecfive refrieval mefhod. In fhe 
visually-enriched Language Model, we inferpolafe 
(using equal weighfs) fhe ranks produced by fhe 
fwo models. In fhe resulfing combinafion, affribufes 
fhaf are bofh linguisfically sensible and likely fo be 
presenf in fhe given image should be ranked high- 
esf. We expecf fhis approach fo be challenging fo 
beaf. MacKenzie (20141 recenfly infroduced a simi¬ 
lar model in a supervised selling, where if improved 
over slandard aflribule classifiers. 



LM 

SP 

vLM 

DlR° 

Dec 

DlR^ 

@1 

2 

0 

5 

1 

10 

7 

@5 

5 

7 

16 

4 

31 

23 

@10 

8 

9 

29 

9 

44 

37 

@20 

18 

17 

50 

19 

59 

51 

@50 

33 

32 

72 

43 

81 

68 

@100 

56 

55 

82 

67 

89 

77 


Table 4: Percentage hit@A: attribute retrieval scores. 


SP The Selectional Preference model robustly 
captures semantic restrictions imposed by a noun on 
the adjectives modifying it ( Erk et ah, 2010[ ). Con¬ 
cretely, for each noun denoting a target object, we 
identify a set of adjectives ADJnoun that co-occur 
with it in a modifier relation more fhaf 20 times. 
By averaging fhe linguistic vecfors of Ihese adjec- 
tives, we obtain a vector u’noun ^ which should 
caplure fhe semantics of fhe prototypical adjeclives 
for fhaf noun. Adjectives fhaf have higher similar- 
ify wifh fhis prolofype vecfor are expecfed fo denofe 
lypical affribufes of fhe corresponding noun and will 
be ranked as more probable allribules. Similarly fo 
LM, all images of fhe same objecl produce idenfical 
rankings. As an example, among fhe lop affribufes 
of cocktail we And fantastic, delicious and perfect. 


4.4 Results 

We evaluale fhe performance of fhe models on 
affribule-denofing adjective refrieval, using a search 
space confaining fhe lop 5,000 mosl frequenl ad¬ 
jectives in our corpus. Tables and presenf 
hil@A: and recall@fc resulls, respectively {k G 
{1,5,10,20,50,100}). Hit@k measures fhe per- 
cenlage of images for which al leasl one gold al- 
Iribufe exisls among fhe lop k relrieved allribules. 
Recall® k measures fhe proportion of gold attributes 
retrieved among the top k, relative to the total num¬ 
ber of gold attributes for each imagep] 

First of all, we observe that LM and SP - the two 
models that have access to gold object-noun annota¬ 
tion and are entirely language-based - although well 
above the random baseline (fe/5,000), achieve rather 
low performance. This confirms fhaf fo model our 
lesl sel accuralely, if is nof sufficienl fo predicf lypi¬ 
cal allribules of fhe depicled objecls. 


'^https://code.google.com/p/berkeleylm/ 
https : //catalog. Idc. upenn . edu/ 
LDC2006T13 


'^Due to the leave-one-attribute-out approach used to train 
and test DlR'^ (see Section |^, it is not possible to compute 
recall results for this model. 




























LM 

SP 

vLM 

DiR° 

Dec 

@1 

1 

0 

2 

0 

4 

@5 

2 

3 

7 

2 

15 

@10 

3 

5 

15 

4 

23 

@20 

9 

10 

30 

9 

35 

@50 

20 

20 

49 

22 

59 

@100 

35 

34 

61 

44 

70 


Table 5: Percentage recall@fc attribute retrieval scores. 



DlR° 

Dec 

DlR-^ 

@1 

1 

2 

0 

@5 

3 

10 

0 

@10 

5 

14 

1 

@20 

9 

20 

2 

@50 

20 

29 

6 

@100 

33 

41 
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Table 6; Percentage hit@fc noun retrieval scores. 


The DlR° method, which exploits visual in¬ 
formation, performs numerically similarly to the 
object-informed models LM and SP, with better 
hit and recall at high ranks. Although worse than 
DlR^, the relatively high performance of DiR® is 
a promising result, suggesting object annotations to¬ 
gether with linguistic knowledge extracted in an un¬ 
supervised manner from large corpora can replace, 
to some extent, manual attribute annotations. How¬ 
ever, DlR° does not directly model any semantic 
compatibility constraints between the retrieved ad¬ 
jectives and the object present in the image (see ex¬ 
amples below). Hence, the object-informed model 
vLM, which combines visual information wit lin¬ 
guistic co-occurrence statistics, doubles the perfor¬ 
mance of DlR°, LM and SP. 

Our Dec model, which treats images as visual 
phrases and jointly decouples their semantics, out¬ 
performs even vLM by a large margin. It also out¬ 
performs DlR'^, the standard zero-shot learning ap¬ 
proach using attribute-adjective annotated data (see 
also the attribute-by-attribute AUC comparison be¬ 
tween Dec, DiR'^ and the fully-supervised ap¬ 
proach of Russakovsky and Fei-Fei in Figure [^. 

Interestingly, accounting for the phrasal nature of 
visual information leads to substantial performance 
improvement in object recognition through zero- 
shot learning (i.e., tagging images with the depicted 
nouns) as well. Table provides the hit@k results 
obtained with the DlR*^ and Dec methods for the 
noun retrieval task in a search space of 10,000 most 


Image 



A: white, brown 
N: dog 



A: shiny, round 
N: syrup 


Model 

Top item 

Top hit (Rank) 

Dec 

A: white 

N: dog 

white (1) 
dog (1) 

DlR° 

A: animal 

N: goat 

white (27) 
dog (25) 

LM 

A: stray 

hrown (74) 

vLM 

A: pet 

brown (17) 

Dec 

A: shiny 

N: flan 

shiny (1) 
syrup (170) 

DlR° 

A:crunchy 

N: ramekin 

shiny (15) 
syrup(113) 

LM 

A: chocolate 

shiny (84) 

vLM 

A: chocolate 

shiny (17) 


Table 7: Images with gold attribute-adjective and object- 
noun labels, and highest-ranked items for each model 
(Top item), as well as highest-ranked correct item and 
rank (Top hit). Noun results for (v)LM are omitted since 
these models have access to the gold noun label. 


frequent nouns from our corpus. Note that DlR° 
represents the label retrieval technique that has been 
standardly used in conjunction with zero-shot learn¬ 
ing for objects: The cross-modal function is trained 
on images annotated with nouns that denote the ob¬ 
jects they depict, and it is then used for noun label 
retrieval of unseen objects through a nearest neigh¬ 
bor search of the mapped image representation (the 
DlR^ column shows that zero-shot noun retrieval 
using the mapping function trained on adjectives 
works very poorly). Dec decomposes instead the 
mapped image representation into two vectors de¬ 
noting adjective and noun semantics, respectively, 
and uses the latter to perform the nearest neigh¬ 
bor search for a noun label. Although not directly 
comparable, the results of Dec reported here are in 
the same range of state-of-the-art zero-shot learning 
models for object recognition ( [Frome et ah, 2013 ). 

Annotation examples Table presents some in¬ 
teresting patterns we observed in the results. The 
first example illustrates the case in which conducting 
adjective and noun retrieval independently results in 
mixing information, which damages the DlR° ap¬ 
proach: Adjectival and nominal properties are not 
decoupled properly, since the animal property of the 
depicted dog is reflected in both the animal adjec¬ 
tive and the goat noun. At the same time, the white- 


























ness of the object (an adjectival property) influences 
noun selection, since goats tend to be white. Instead, 
Dec unpacks the visual semantics in an accurate 
and meaningful way, producing correct attribute and 
noun annotations that form acceptable phrases. LM 
and vLM are negatively affected by co-occurrence 
statistics and guess stray and pet as adjectives, both 
typical but generic and abstract dog properties. 

In the next example, DlR° predicts a reason¬ 
able noun label {ramekin), focusing on the container 
rather than the liquid it contains. By ignoring the 
relation between the adjective and the noun, the re¬ 
sulting adjective annotation {crunchy) is semanti¬ 
cally incompatible with the noun label, emphasizing 
the inability of this method to account for semantic 
relations between attributes-adjectives and object- 
nouns. Dec, on the other hand, mistakenly anno¬ 
tates the object as flan instead of syrup. However, 
having captured the right general category of the ob¬ 
ject (“smooth gelatinous items that reflect light”), 
it ranks a semantically appropriate and correct at¬ 
tribute {shiny) at the top. Finally, LM and vLM 
choose chocolate, an attribute semantically appro¬ 
priate for syrup but irrelevant for the target image. 

Semantic plausibility of phrases The examples 
above suggest that one fundamental way in which 
Dec improves over DlR° is by producing seman¬ 
tically coherent adjective-noun combinations. More 
systematic evidence for this conjecture is provided 
by a follow-up experiment on the linguistic qual¬ 
ity of the generated phrases. We randomly sampled 
2 images for each of the 203 objects in our data 
set. For each image, we let the two models gen¬ 
erate 9 descriptive phrases by combining their re¬ 
spective top 3 adjective and noun predictions. From 
the resulting lists of 3,654 phrases, we picked the 
200 most common ones for each model, with only 
1/8 of these common phrases being shared by both. 
The selected phrases were presented (in random or¬ 
der and concealing their origin) to two linguistically- 
sophisticated annotators, who were asked to rate 
their degree of semantic plausibility on a 1-3 scale 
(the annotators were not shown the corresponding 
images and had to evaluate phrases purely on lin¬ 
guistic/semantic grounds). Since the two judges 
were largely in agreement (p = 0.63), we averaged 
their ratings. The mean averaged plausibility score 



Figure 4: Distributions of (per-image) concreteness 
scores across different models. Red line marks median 
values, box edges correspond to 1st and 3rd quartiles, the 
wiskers extend to the most extreme data points and out¬ 
liers are plotted individually. 


for DlR° phrases was 1.74 (s.d.: 0.76), for Dec it 
was 2.48 (s.d.: 0.64), with the difference significant 
according to a Mann-Whitney test. The two anno¬ 
tators agreed in assigning the lowest score (“com¬ 
pletely implausible”) to more than 1/3 of the DlR° 
phrases (74/200; e.g., tinned tostada, animal bird, 
hollow hyrax), but they unanimously assigned the 
lowest score to only 7/200 Dec phrases (e.g., cylin¬ 
drical bed-sheet, sweet ramekin, wooden meat). We 
thus have solid quantitative support that the superior¬ 
ity of Dec is partially due to how it learns to jointly 
account for adjective and noun semantics, producing 
phrases that are linguistically more meaningful. 


Adjective concreteness We can gain further in¬ 
sight into the nature of the adjectives chosen by 
the models by considering the fact that phrases that 
are meant to describe an object in a picture should 
mostly contain concrete adjectives, and thus the de¬ 
gree of concreteness of the adjectives produced by a 
model is an indirect measure of its quality. Follow¬ 
ing Hill and Korhonen (20141, we define the con¬ 
creteness of an adjective as the average concreteness 
score of the nouns it modifies in our text corpus. 
Noun concreteness scores are taken, in turn, from 
Turney et al. (201 1}. For each test image and model. 


we obtain a concreteness score by averaging the con¬ 
creteness of the top 5 adjectives that the model se¬ 
lected for the image. Figure reports the distribu¬ 
tions of the resulting scores across models. We con- 






























Object Predicted 
Attributes 


firm that the purely language-based models (LM, 
SP) are producing generic abstract adjectives that 
are not appropriate to describe images (e.g., crypto¬ 
graphic key, homemade bread, Greek salad, beaten 
yolk). The image-informed vLM and DlR° models 
produce considerably more concrete adjectives. Not 
surprisingly, DlR^, that was directly trained on con¬ 
crete adjectives, produces the most concrete ones. 
Importantly, Dec, despite being based on a cross- 
modal function that was not explicitly exposed to 
adjectives, produced adjectives that are approaching 
the concreteness level of those of DlR^ (both differ¬ 
ences between Dec and DiR®, Dec and DlR^ are 
significant as by paired Mann-Whitney tests). 


Image 


aeroplane thick, wet, dry, 

cylindrical, 
motionless, 
translucent 

dog cuddly, wild, 

cute, furry, 
white, coloured 


Table 8; Two VOC images with some top attributes as¬ 
signed by Dec: these attributes, together with their co¬ 
sine similarities to the mapped image vectors, serve as 
attribute-centric representations. 



5 Using Dec for attribute-based object 
classification 

As discussed in the introduction, attributes can ef¬ 
fectively be used for attribute-based object clas¬ 
sification. In this section, we show that clas¬ 
sifiers trained on attribute representations created 
with Dec - which does not require any attribute- 
annotated training data nor training a battery of at¬ 
tribute classifiers - outperform (and are complemen¬ 
tary to) standard BoVW features. 

We use a subset of the Pascal VOC 2008 datasetP^ 
Specifically, following Farhadi et al. (2009) , we use 
the original VOC training set for training/validation, 
and the VOC validation set for testing. One-vs-all 
linear-SVM classifiers are trained for all VOC ob¬ 
jects, using 3 alternative image representations. 

First, we train directly on BoVW features 
(PHOW, see Section |2.3| ), as in the classic object 
recognition pipeline. We compare PHOW to an 
attribute-centric approach with attribute labels auto¬ 
matically generated by Dec. All VOC images are 
projected onto the linguistic space using the cross- 
modal mapping function trained with object-noun 
data only (see Section |4.1[ ), from which we further 
removed all images depicting a VOC object. Each 
image projection is then decomposed through Dec 
into two vectors representing adjective and noun in¬ 
formation. The final attribute-centric vector repre¬ 
senting an image is created by recording the cosine 
similarities of the DEC-generated adjective vector 


''http://pascallin.ecs.soton.ac.uk/ 

challenges/V0C/voc2008/ 


with all the adjectives in our linguistic space. Infor¬ 
mally, this representation can be thought of as a vec¬ 
tor of weights describing the appropriateness of each 
adjective as an annotation for the image{^ This is 
comparable to standard attribute-based classification 
(Farhadi et ah, 20091, in which images are repre¬ 
sented as distributions over attributes estimated with 
a set of ad hoc supervised attribute-specific classi¬ 
fiers. Table show examples of top attributes auto¬ 
matically assigned by Dec. While not nearly as ac¬ 
curate as manual annotation, many attributes are rel¬ 
evant to the objects, both as specifically depicted in 
the image (the aeroplane is wet), but also more pro- 
totypically (aeroplanes are cylindrical in general). 

We also perform feature-level fusion (FUSED) by 
concatenating the PHOW and Dec features, and re¬ 
ducing the resulting vector to 100 dimensions with 
SVD ( |Bruni et ah, 2014 ) (SVD dimensionality de¬ 
termined by cross-validation on the training set). 


5.1 Results 

There is an improvement over PHOW visual features 
when using D EC-based attribute vectors, with accu¬ 
racy raising from 30.49% to 32.76%. The confusion 
matrices in Figure]^ show that PHOW and Dec do 
not only differ in quantitative performance, but make 
different kinds of errors, in part pointing at the dif¬ 
ferent modalities the two models tap into. PHOW, 
for example, tends to confuse cats with sofas, prob¬ 
ably because the former are often pictured lying on 

'^Given that the resulting representations are very dense, we 
sparsify them by setting to zeros all adjective dimensions with 
cosine below the global mean cosine value. 



















6 Conclusion 
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Figure 5: Confusion matrices for PHOW (top) and Dec 
(bottom). Warmer-color cells correspond to higher pro¬ 
portions of images with gold row label tagged by an algo¬ 
rithm with the column label (e.g., the hrst cells show that 
Dec tags a larger proportion of aeroplanes correctly). 


the latter. Dec, on the other hand, tends to con¬ 
fuse chairs with TV monitors, partially misguided 
by the taxonomic information encoded in language 
(both are pieces of furniture). Indeed, the combined 
Fused approach outperforms both representations 
by a large margin (35.81%), confirming that the 
linguistically-enriched information brought by Dec 
is to a certain extent complementary to the lower- 
level visual evidence directly exploited by PHOW. 
Overall, the performance of our system is quite close 
to the one obtained by Farhadi et al. (20091 with en¬ 
sembles of supervised attribute classifiers trained on 
manually annotated data (the most comparable ac¬ 
curacy from their Table 1 is at 34.3%){^ 


'^Farhadi and colleagues reduce the bias for the people cat¬ 
egory by reporting mean per-class accuracy; we directly ex¬ 
cluded people from our version of the data set. 


We extended zero-shot image labeling beyond ob¬ 
jects, showing that it is possible to tag images with 
attribute-denoting adjectives that were not seen dur¬ 
ing training. For some attributes, performance was 
comparable to that of per-attribute supervised classi¬ 
fiers. We further showed that attributes are implicitly 
induced when learning to map visual vectors of ob¬ 
jects to their linguistic realizations as nouns, and that 
improvements in both attribute and noun retrieval 
are attained by treating images as visual phrases, 
whose linguistic representations must be decom¬ 
posed into a coherent word sequence. The resulting 
model outperformed a set of strong rivals. While the 
performance of the zero-shot decompositional ap¬ 
proach in the adjective-noun phrase labeling alone 
might still be low for practical applications, this 
model can still produce attribute-based representa¬ 
tions that significantly improve performance in a 
supervised object recognition task, when combined 
with standard visual features. 

By mapping attributes and objects to phrases in 
a linguistic space, we are also likely to produce 
more natural descriptions than those currently used 
in computer vision (fluffy kittens rather than 2-boxy 
tables). In future work, we want to delve more 
into the linguistic and pragmatic naturalness of at¬ 
tributes: Can we predict not just which attributes 
of a depicted object are true, but which are more 
salient and thus more likely to be mentioned (red 
car over metal car)? Can we pick the most appro¬ 
priate adjective to denote an attribute given the ob¬ 
ject in the picture (moist, rather than damp lips)? 
We should also address attribute dependencies: by 
ignoring them, we currently get undesired results, 
such as the aeroplane in Table [^being tagged as both 
wet and dry. More ambitiously, inspired by |Karpa 


thy et al. (2014), we plan to associate image frag¬ 


ments with phrases of arbitrary syntactic structures 
(e.g., PPs for backgrounds, a VPs for main events), 
paving the way to full-fledged caption generation. 


Acknowledgments 

We thank the TACL reviewers for their feedback. 
We were supported by ERC 2011 Starting Indepen¬ 
dent Research Grant n. 283554 (COMPOSES). 










References 

Tamara Berg, Alexander Berg, and Jonathan Shih. 2010. 
Automatic attribute discovery and characterization 
from noisy Web data. In Proceedings ofECCV, pages 
663-676, Crete, Greece. 

Anna Bosch, Andrew Zisserman, and Xavier Munoz. 
2007. Image classification using random forests and 
ferns. In Proceedings of ICCV, pages 1-8, Rio de 
Janeiro, Brazil. 

Thorsten Brants, Ashok Popat, Peng Xu, Franz Och, and 
Jeffrey Dean. 2007. Large language models in ma¬ 
chine translation. In Proceedings of EMNLP, pages 
858-867, Prague, Czech Republic. 

Elia Bruni, Nam Khanh Tran, and Marco Baroni. 2014. 
Multimodal distributional semantics. Journal of Arti¬ 
ficial Intelligence Research, 49:1^7. 

Jia Deng, Wei Dong, Richard Socher, Lia-Ji Li, and 
Li Fei-Fei. 2009. Imagenet: A large-scale hierarchi¬ 
cal image database. In Proceedings of CVPR, pages 
248-255, Miami Beach, FL. 

Georgiana Dinu and Marco Baroni. 2014. How to make 
words with vectors: Phrase generation in distributional 
semantics. In Proceedings of ACL, pages 624-633, 
Baltimore, MD. 

Santosh Divvala, Ali Farhadi, and Carlos Guestrin. 
2014. Learning everything about anything: Webly- 
supervised visual concept learning. In Proceedings of 
CVPR, Columbus, OH. 

Katrin Erk, Sebastian Pado, and Ulrike Pado. 2010. A 
flexible, corpus-driven model of regular and inverse 
selectional preferences. Computational Linguistics, 
36(4):723-763. 

Stefan Evert. 2005. The Statistics of Word Cooccur¬ 
rences. Ph.D dissertation, Stuttgart University. 

Ali Farhadi, Ian Endres, Derek Hoiem, and David 
Forsyth. 2009. Describing objects by their attributes. 
In Proceedings of CVPR, pages 1778-1785, Miami 
Beach, FL. 

Ali Farhadi, Mohsen Hejrati, Mohammad A. Sadeghi, 
Peter Young, Cyrus Rashtchian, Julia Hockenmaier, 
and David Forsyth. 2010. Every picture tells a story: 
Generating sentences from images. In Proceedings of 
ECCV, Crete, Greece. 

Vittorio Ferrari and Andrew Zisserman. 2007. Learning 
visual attributes. In Proceedings of NIPS, pages 433- 
440, Vancouver, Canada. 

Andrea Frome, Greg Corrado, Jon Shlens, Sarny Ben- 
gio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas 
Mikolov. 2013. DeViSE: A deep visual-semantic em¬ 
bedding model. In Proceedings of NIPS, pages 2121- 
2129, Lake Tahoe, NV. 

Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hock¬ 
enmaier, and Svetlana Lazebnik. 2014. Improving 


image-sentence embeddings using large weakly an¬ 
notated photo collections. In Proceedings of ECCV, 
pages 529-545, Zurich, Switzerland. 

David R Hardoon, Sandor Szedmak, and John Shawe- 
Taylor. 2004. Canonical correlation analysis: An 
overview with application to learning methods. Neu¬ 
ral Computation, 16(12):2639-2664. 

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 
2009. The Elements of Statistical Learning, 2nd edi¬ 
tion. Springer, New York. 

Felix Hill and Anna Korhonen. 2014. Concreteness and 
subjectivity as dimensions of lexical meaning. In Pro¬ 
ceedings of ACL, pages 725-731, Baltimore, Mary¬ 
land. 

Micah Hodosh, Peter Young, and Julia Hockenmaier. 
2013. Framing image description as a ranking task: 
Data, models and evaluation metrics. Journal of Arti¬ 
ficial Intelligence Research, 47:853-899. 

Harold Hotelling. 1936. Relations between two sets of 
variates. Biometrika, 28(3/4):321-377. 

Mark Huiskes and Michael Lew. 2008. The MIR Flickr 
retrieval evaluation. In Proceedings of MIR, pages 39- 
43, New York, NY. 

Andrej Karpathy, Armand Joulin, and Li Fei-Fei. 2014. 
Deep fragment embeddings for bidirectional image 
sentence mapping. In Proceedings of NIPS, pages 
1097-1105, Montreal, Canada. 

Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. 
2012. ImageNet classification with deep convolutional 
neural networks. In Proceedings of NIPS, pages 1097- 
1105, Lake Tahoe, Nevada. 

Alex Krizhevsky. 2009. Learning multiple layers of fea¬ 
tures from tiny images. Master’s thesis. 

Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming 
Li, Yejin Choi, Alexander Berg, and Tamara Berg. 
2011. Baby talk: Understanding and generating sim¬ 
ple image descriptions. In Proceedings of CVPR, 
pages 1601-1608, Colorado Springs, CO. 

Christoph H Lampert, Hannes Nickisch, and Stefan 
Harmeling. 2009. Learning to detect unseen object 
classes by between-class attribute transfer. In Pro¬ 
ceedings of CVPR, pages 951-958, Miami Beach, FL. 

Angeliki Lazaridou, Elia Bruni, and Marco Baroni. 2014. 
Is this a wampimuk? cross-modal mapping between 
distributional semantics and the visual world. In Pro¬ 
ceedings of ACL, pages 1403-1414, Baltimore, MD. 

Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. 
2006. Beyond bags of features: Spatial pyramid 
matching for recognizing natural scene categories. In 
Proceedings of CVPR, pages 2169-2178, Washington, 
DC. 



Calvin MacKenzie. 2014. Integrating visual and linguis¬ 
tic information to describe properties of objects. Un¬ 
dergraduate Honors Thesis, Computer Science Depart¬ 
ment, University of Texas at Austin. 

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey 
Dean. 2013. Efficient estimation of word representa¬ 
tions in vector space, http://arxiv.org/abs/ 
1301.3781/ 

Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa 
Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, 
Tamara Berg, Karl Stratos, and Hal Daume III. 2012. 
Midge: Generating image descriptions from computer 
vision detections. In Proceedings of EACL, pages 
747-756, Avignon, France. 

Gregory Murphy. 2002. The Big Book of Concepts. MIT 
Press, Cambridge, MA. 

Vicente Ordonez, Jia Deng, Yejin Choi, Alexander Berg, 
and Tamara Berg. 2013. From large scale image cate¬ 
gorization to entry-level categories. In Proceedings of 
ICCV, pages 1-8, Sydney, Australia. 

Genevieve Patterson, Chen Xu, Hang Su, and James 
Hays. 2014. The SUN attribute database: Beyond cat¬ 
egories for deeper scene understanding. International 
Journal of Computer Vision, 108(l-2):59-81. 

Adam Pauls and Dan Klein. 2012. Large-scale syntactic 
language modeling with treelets. In Proceedings of 
ACL, pages 959-968, Jeju Island, Korea. 

Olga Russakovsky and Li Fei-Fei. 2010. Attribute learn¬ 
ing in large-scale datasets. In Proceedings of ECCV, 
pages 1-14. 

Mohammad Sadeghi and Ali Farhadi. 2011. Recognition 
using visual phrases. In Proceedings ofCVPR, pages 
1745-1752, Colorado Springs, CO. 

Carina Silberer, Vittorio Ferrari, and Mirella Lapata. 
2013. Models of semantic representation with visual 
attributes. In Proceedings of ACL, pages 572-582, 
Soha, Bulgaria. 

Josef Sivic and Andrew Zisserman. 2003. Video Google: 
A text retrieval approach to object matching in videos. 
In Proceedings of ICCV, pages 1470-1477, Nice, 
France. 

Richard Socher, Milind Ganjoo, Christopher Manning, 
and Andrew Ng. 2013. Zero-shot learning through 
cross-modal transfer. In Proceedings of NIPS, pages 
935-943, Lake Tahoe, NV. 

Peter Turney and Patrick Pantel. 2010. From frequency 
to meaning: Vector space models of semantics. Jour¬ 
nal of Artificial Intelligence Research, 37:141-188. 

Peter Turney, Yair Neuman, Dan Assaf, and Yohai Co¬ 
hen. 2011. Literal and metaphorical sense identih- 
cation through concrete and abstract context. In Pro¬ 
ceedings ofEMNLP, pages 680-690, Edinburgh, UK. 


Laurens Van der Maaten and Geoffrey Hinton. 2008. 
Visualizing data using t-SNE. Journal of Machine 
Learning Research, 9(2579-2605). 

Andrea Vedaldi and Brian Fulkerson. 2010. VLFeat - 
an open and portable library of computer vision al¬ 
gorithms. In Proceedings of ACM Multimedia, pages 
1469-1472, Firenze, Italy. 

Andrea Vedaldi, Siddarth Mahendran, Stavros Tsogkas, 
Subhransu Maji, Ross Girshick, Juho Kannala, Esa 
Rahtu, lasonas Kokkinos, Matthew Blaschko, David 
Weiss, Ben Taskar, Karen Simonyan, Naomi Saphra, 
and Sammy Mohamed. 2014. Understanding objects 
in detail with fine-grained attributes. In Proceedings 
ofCVPR, Columbus, OH. 


